NCDSearch: Sliding Window-Based Code Clone Search Using Lempel-Ziv Jaccard Distance

نویسندگان

چکیده

Software developers may write a number of similar source code fragments including the same mistake in software products. To remove such faulty fragments, inspect clones if they found bug their code. While various clone detection methods have been proposed to identify either blocks or functions, those tools do not always fit inspection task because fragment be much smaller than blocks, e.g. single line enable search small large-scale product, we propose method using Lempel-Ziv Jaccard Distance, which is an approximation Normalized Compression Distance. We conducted experiment existing research dataset and user survey company. The result shows our efficiently reports cloned performance acceptable for developers.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lempel-Ziv Compression in a Sliding Window

We present new algorithms for the sliding window Lempel-Ziv (LZ77) problem and the approximate rightmost LZ77 parsing problem. Our main result is a new and surprisingly simple algorithm that computes the sliding window LZ77 parse in O(w) space and either O(n) expected time or O(n log logw + z log log σ) deterministic time. Here, w is the window size, n is the size of the input string, z is the ...

متن کامل

Lempel-Ziv Jaccard Distance, an Effective Alternative to Ssdeep and Sdhash

Recent work has proposed the Lempel-Ziv Jaccard Distance (LZJD) as a method to measure the similarity between binary byte sequences for malware classification. We propose and test LZJD’s effectiveness as a similarity digest hash for digital forensics. To do so we develop a high performance Java implementation with the same command-line arguments as sdhash, making it easy to integrate into exist...

متن کامل

Lempel-Ziv Factorization: LZ77 without Window

To construct the su x array of a string S boils down to sorting all su xes of S in lexicographic order (also known as alphabetical order, dictionary order, or lexical order). This order is induced by an order on the alphabet Σ. In this manuscript, Σ is an ordered alphabet of constant size σ. It is sometimes convenient to regard Σ as an array of size σ so that the characters appear in ascending ...

متن کامل

Lempel-Ziv Dimension for Lempel-Ziv Compression

This paper describes the Lempel-Ziv dimension (Hausdorff like dimension inspired in the LZ78 parsing), its fundamental properties and relation with Hausdorff dimension. It is shown that in the case of individual infinite sequences, the Lempel-Ziv dimension matches with the asymptotical Lempel-Ziv compression ratio. This fact is used to describe results on Lempel-Ziv compression in terms of dime...

متن کامل

On Match Lengths, Zero Entropy and Large Deviations - with Application to Sliding Window Lempel-Ziv Algorithm

The Sliding Window Lempel-Ziv (SWLZ) algorithm that makes use of recurrence times and match lengths has been studied from various perspectives in information theory literature. In this paper, we undertake a finer study of these quantities under two different scenarios, i) zero entropy sources that are characterized by strong long-term memory, and ii) the processes with weak memory as described ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEICE Transactions on Information and Systems

سال: 2022

ISSN: ['0916-8532', '1745-1361']

DOI: https://doi.org/10.1587/transinf.2021edp7222